1) Clustering and PCA

K-means Clustering

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0        Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0        1st Qu.:0.9923  
##  Median :0.04700   Median : 29.00      Median :118.0        Median :0.9949  
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7        Mean   :0.9947  
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0        3rd Qu.:0.9970  
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :10.30   Median :6.000  
##  Mean   :3.219   Mean   :0.5313   Mean   :10.49   Mean   :5.818  
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :9.000  
##     color          
##  Length:6497       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

PCA

PC1 PC2 PC3 PC4 PC5 PC6
fixed.acidity -0.2387989 0.3363545 -0.4343013 0.1643462 -0.1474804 -0.2045537
volatile.acidity -0.3807575 0.1175497 0.3072594 0.2127849 0.1514560 -0.4921431
citric.acid 0.1523884 0.1832994 -0.5905697 -0.2643003 -0.1553487 0.2276338
residual.sugar 0.3459199 0.3299142 0.1646884 0.1674430 -0.3533619 -0.2334778
chlorides -0.2901126 0.3152580 0.0166791 -0.2447439 0.6143911 0.1609764
free.sulfur.dioxide 0.4309140 0.0719326 0.1342239 -0.3572789 0.2235323 -0.3400514
## Importance of first k=6 (out of 11) components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.7407 1.5792 1.2475 0.98517 0.84845 0.77930
## Proportion of Variance 0.2754 0.2267 0.1415 0.08823 0.06544 0.05521
## Cumulative Proportion  0.2754 0.5021 0.6436 0.73187 0.79732 0.85253

2) Market segmentation

Method to be used is K-means clustering. With this method, we will be able to detect interesting market segments that seem to be exceptional within NutrientH20’s social-media audience.

Read the social_marketing.csv and let’s take a look for the 36 different categories
36 different categories of interest
chatter current_events travel photo_sharing uncategorized tv_film
sports_fandom politics food family home_and_garden music
news online_gaming shopping health_nutrition college_uni sports_playing
cooking eco computers business outdoors crafts
automotive art religion beauty parenting dating
school personal_fitness fashion small_business spam adult

With K-means clustering, we can cluster these categories, for example, “physical wellness” can be one cluster containing personal_fitness, health_nutrition, outdoors.

As mentioned in the question, we can check there are categories like spam, adult, uncategorized, and I will remove spam and adult to clean our dataset.

# drop first column, which user has labeled as random 9-digit code
X=data[,-1]
# drop spam and adult column which slip through the data
X=X[,-(35:36)]
# center and scale
X=scale(X, center=TRUE, scale=TRUE)
# Extract the centers and scales from the rescaled data (which are named attributes)
mu = attr(X,"scaled:center")
sigma = attr(X,"scaled:scale")

After cleaning/centering/scaling the data, I will start with correlation plot for k-means clustering, as correlation plot can visualize which categories in the dataset are strongly correlated each other, and also can identify which categories have similar scales.

As there’s so many variables, I will sort highest correlations.
Highest correlation among categories
Var1 Var2 Freq
health_nutrition personal_fitness 0.8099024
online_gaming college_uni 0.7728393
cooking fashion 0.7214027
cooking beauty 0.6642389
travel politics 0.6602100
religion parenting 0.6555973
sports_fandom religion 0.6379748
beauty fashion 0.6349739
health_nutrition outdoors 0.6082254
sports_fandom parenting 0.6077181
travel computers 0.6029349

As you can see, health_nutrition and personal_fitness has the highest correlation and this is exactly what I thought would be in the same cluster - “physical wellness”.

We’ve finished previewing for k-means clustering with correlation plot, and we’ll start analyzing with k-means clustering.

K-means clustering

First, will start from choosing optimal K, the amount of clusters. Below is Elbow plot. Elbow plot used to determine the optimal number of clustering. The plot displays within-cluster sum of squares(WSS) as a function of the number of clusters.

10 seems to me to be the elbow point, so we’ll use 10 for k.

We can get surface-level information about market segments for NutrientH20.

cluster1
dating 9.293814
chatter 7.963917
photo_sharing 2.639175
fashion 2.505155
school 2.257732
cluster2
chatter 9.731328
photo_sharing 5.963693
shopping 4.120332
current_events 1.992739
health_nutrition 1.600622
cluster3
health_nutrition 12.591270
personal_fitness 6.661376
chatter 3.767196
cooking 3.414021
outdoors 2.903439
cluster4
politics 11.267241
travel 9.103448
computers 4.100575
chatter 4.060345
news 3.617816
cluster5
tv_film 5.597087
art 5.038835
chatter 3.929612
college_uni 2.548544
photo_sharing 2.453883
cluster6
news 6.825059
politics 5.517730
automotive 4.392435
chatter 4.115839
sports_fandom 3.061466
cluster7
cooking 11.788009
photo_sharing 6.083512
fashion 5.995717
beauty 4.218415
chatter 4.194861
cluster8
college_uni 11.098870
online_gaming 10.850283
chatter 4.096045
sports_playing 2.745763
photo_sharing 2.655367
cluster9
sports_fandom 6.196347
religion 5.557078
food 4.727550
parenting 4.258752
chatter 3.849315
cluster10
chatter 3.080133
photo_sharing 1.548533
current_events 1.264893
health_nutrition 1.143030
travel 1.084971

3) Association rules for grocery purchases

analysis

groceries.txt file contains a total of 9,835 unique shopping baskets. We frst went through some data wrangling process before conducting Market Basket Analysis using the “arules” package. As for the thresholds, we chose support of .001, confidence of .5, and maxlen of 10. A relatively low support of .001 was chosen because we wanted to capture as many items as possible from the dataset. Confidence of .5 was chosen to sort out weak associations. Lastly, we limited the maximum number of items per item set to be 10 to account for as many possible grocery combinations as possible. Runinng the algorithym using the above threshold resulted in 5,668 rules, which we thought was enough for this analysis. Below are two plots showing the resulting rules; the first is plotted between support and lift, while the second is between support and confidence.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9836 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Below is a table that shows the top ten rules with the highest confidence. Confidence shows the probability of having item(s) on the RHS given those on the LHS are purchased. You can see that out of the top ten rules, the most frequent RHS items are milk and other vegetables. However, this does not reveal much about association. Take {canned fish,hygiene articles} -> {whole milk} as an example. Intuitively, buying canned fish and hygiene articles doesn’t seem to have anything to do with buying whole milk. However, this this is still at thoe top of the list simply because whole milk gets bought the most frequently when people go grocery shopping, regardless of what other items they purchase. To see more relevant association rules, let’s look at a list with the highest lift.

Top 10 rules with the highest confidence
LHS RHS support confidence coverage lift count
{rice,sugar} {whole milk} 0.0012200 1 0.0012200 3.914047 12
{canned fish,hygiene articles} {whole milk} 0.0011183 1 0.0011183 3.914047 11
{butter,rice,root vegetables} {whole milk} 0.0010167 1 0.0010167 3.914047 10
{flour,root vegetables,whipped/sour cream} {whole milk} 0.0017283 1 0.0017283 3.914047 17
{butter,domestic eggs,soft cheese} {whole milk} 0.0010167 1 0.0010167 3.914047 10
{citrus fruit,root vegetables,soft cheese} {other vegetables} 0.0010167 1 0.0010167 5.168681 10
{butter,hygiene articles,pip fruit} {whole milk} 0.0010167 1 0.0010167 3.914047 10
{hygiene articles,root vegetables,whipped/sour cream} {whole milk} 0.0010167 1 0.0010167 3.914047 10
{hygiene articles,pip fruit,root vegetables} {whole milk} 0.0010167 1 0.0010167 3.914047 10
{cream cheese ,domestic eggs,sugar} {whole milk} 0.0011183 1 0.0011183 3.914047 11

Below is a table showing the top ten rules with the highest lift. Lift is different from confidence in that it is the ratio between confidence and expected confidence. In other words, lift measures the relative strength of association between LHS and RHS. It takes care of the high frequency issue of whole milk purchase we observed above. Lift > 1 indicates that the association rule improves the chances of outcome, where as lift < 1 reveals that the model lowers the chance of the outcome. Lift = 1 does not have any effect on the outcome. The result here is much more interesting and informative.

{popcorn,soda} -> {salty snack} Here, it seems like people are getting ready for a movie night. People who buy popcorn and soda are likely to buy other salty snacks. Thus, the model makes sense.

{ham,processed cheese} -> {white bread}These are ingredients to make a quick sandwich. Hence, the rule makes sense again.

Top 10 rules with the highest lift
LHS RHS support confidence coverage lift count
{Instant food products,soda} {hamburger meat} 0.0012200 0.6315789 0.0019317 18.99759 12
{popcorn,soda} {salty snack} 0.0012200 0.6315789 0.0019317 16.69949 12
{baking powder,flour} {sugar} 0.0010167 0.5555556 0.0018300 16.40974 10
{ham,processed cheese} {white bread} 0.0019317 0.6333333 0.0030500 15.04702 19
{Instant food products,whole milk} {hamburger meat} 0.0015250 0.5000000 0.0030500 15.03976 15
{curd,other vegetables,whipped/sour cream,yogurt} {cream cheese } 0.0010167 0.5882353 0.0017283 14.83560 10
{domestic eggs,processed cheese} {white bread} 0.0011183 0.5238095 0.0021350 12.44490 11
{other vegetables,tropical fruit,white bread,yogurt} {butter} 0.0010167 0.6666667 0.0015250 12.03180 10
{hamburger meat,whipped/sour cream,yogurt} {butter} 0.0010167 0.6250000 0.0016267 11.27982 10
{domestic eggs,other vegetables,tropical fruit,whole milk,yogurt} {butter} 0.0010167 0.6250000 0.0016267 11.27982 10

The last plot is a graph-visualization representing the association rules. Each item in the LHS is connected with to the RHS item, and the arrows indicate the direction of the relationship.

#plot
#saveAsGraph(subset, file = "groceriesrules.graphml")

# graph-based visualization
# export
# associations are represented as edges
# For rules, each item in the LHS is connected
# with a directed edge to the item in the RHS. 
groceries_graph = associations2igraph(subset)
igraph::write_graph(groceries_graph, file='groceries.graphml', format = "graphml")

grViz("groceries.graphml")